About the presenter

Dianne Cook
Distinguished Professor
Monash University


🌐 https://dicook.org/

🦣 @visnut@aus.social

@visnut.bsky.social

  • I have a PhD in Statistics from Rutgers University, NJ, and a Bachelor of Science (Pure Mathematics, Statistics and Biochemistry) from University of New England

  • I am a Fellow of the American Statistical Association, elected member of the the R Foundation and International Statistical Institute, Past-Editor of the Journal of Computational and Graphical Statistics, and the R Journal.

  • My research is in data visualisation, statistical graphics and computing, with application to sports, ecology and bioinformatics. I like to develop new methodology and software.

  • Students in my lab work on methods and software that is generally useful for the world. They have been responsible for bringing you the ggplot2, tidyverse suite, knitr, plotly, and many other R packages are frequently used.

Got a question, or a comment?



✋ 🔡 You can ask directly by raising your hand. any time.



I hope you have many questions! 🙋🏻👣

Outline

Follow along

Summary of materials

This workshop is designed to provide you with every day tools to improve your data analysis efficiency and effectiveness. All the code and examples to reproduce everything discussed are available at https://dicook.github.io/BAPPENAS_2025/

What is “big data”?


Big data is extremely overhyped and not terribly well defined. Many people think they have big data, when they actually don’t. ~Hadley Wickham


  • Big data problems that are actually small data problems, once you have the right subset/sample/summary. ~90%
  • Big data problems that are actually lots and lots of small data problems, e.g. you need to fit one model per individual for thousands of individuals. ~9%
  • Finally, there are irretrievably big problems where you do need all the data, perhaps because you fitting a complex model.

At least when you first tackle a data problem, after which you might scale up and automate operations.

Approach

The methods and tools discussed will ensure you can get started and have a process to follow to develop the appropriate analysis.

Tidy data

Using tidyr, dplyr

  • Writing readable code using pipes
  • What is tidy data? Why do you want tidy data? Getting your data into tidy form using tidyr.
  • Reading different data formats
  • String operations, working with text

The pipe operator %>% or |>

  • read as then
  • x %>% f(y) and x |> f(y) is the same as f(x, y)
  • %>% is part of the dplyr package (really, magrittr),
    |> is part of base R
  • pipes structure code as sequence of operations – as opposed to function order g(f(x))

The pipe operator %>% or |>

  • %>% is part of dplyr package (or more precisely, the magrittr package)
  • R 4.1 introduced the |> base pipe (no package necessary)
  • An explanation of the (subtle) differences between the pipes can be found here

Pipe Example

tb <- read_csv(here::here("data/TB_notifications_2025-07-22.csv"))
tb   %>%                                # first we get the tb data
  filter(year == 2023) %>%              # then we focus on the most recent year
  group_by(country) %>%                 # then we group by country
  summarize(
    cases = sum(c_newinc, na.rm=TRUE)   # to create a summary of all new cases
    ) %>%
  arrange(desc(cases))                  # then we sort countries to show highest number of new cases first
tb <- read_csv(here::here("data/TB_notifications_2025-07-22.csv"))
tb |>                                  # first we get the tb data
  filter(year == 2023) |>              # then we focus on the most recent year
  group_by(country) |>                 # then we group by country
  summarize(
    cases = sum(c_newinc, na.rm=TRUE)   # to create a summary of all new cases
    ) |> 
  arrange(desc(cases))                  # then we sort countries to show highest number new cases first
# A tibble: 215 × 2
   country                            cases
   <chr>                              <dbl>
 1 India                            2382714
 2 Indonesia                         804836
 3 Philippines                       575770
 4 China                             564918
 5 Pakistan                          475761
 6 Nigeria                           367250
 7 Bangladesh                        302813
 8 Democratic Republic of the Congo  258069
 9 South Africa                      211810
10 Ethiopia                          134873
# ℹ 205 more rows

What is tidy data?

Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

  • What do we expect tidy data to look like?
  • maybe easier: what are sources of messiness?

Varying degree of messiness

What are the variables? Where are they located?

# A tibble: 6 × 4
  Inst                     AvNumPubs AvNumCits PctCompletion
  <chr>                        <dbl>     <dbl>         <dbl>
1 ARIZONA STATE UNIVERSITY      0.9       1.57          31.7
2 AUBURN UNIVERSITY             0.79      0.64          44.4
3 BOSTON COLLEGE                0.51      1.03          46.8
4 BOSTON UNIVERSITY             0.49      2.66          34.2
5 BRANDEIS UNIVERSITY           0.3       3.03          48.7
6 BROWN UNIVERSITY              0.84      2.31          54.6

What’s in the column names of this data? What are the experimental units? What are the measured variables?

# A tibble: 3 × 12
  id     `WI-6.R1` `WI-6.R2` `WI-6.R4` `WM-6.R1` `WM-6.R2`
  <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1 Gene 1      2.18     2.20       4.20     2.63       5.06
2 Gene 2      1.46     0.585      1.86     0.515      2.88
3 Gene 3      2.03     0.870      3.28     0.533      4.63
# ℹ 6 more variables: `WI-12.R1` <dbl>, `WI-12.R2` <dbl>,
#   `WI-12.R4` <dbl>, `WM-12.R1` <dbl>, `WM-12.R2` <dbl>,
#   `WM-12.R4` <dbl>

What are the variables? What are the records?

           V1   V2 V3   V4  V5  V9 V13 V17 V21 V25 V29 V33
1 ASN00086282 1970  7 TMAX 141 124 113 123 148 149 139 153
2 ASN00086282 1970  7 TMIN  80  63  36  57  69  47  84  78
3 ASN00086282 1970  7 PRCP   3  30   0   0  36   3   0   0
4 ASN00086282 1970  8 TMAX 145 128 150 122 109 112 116 142
5 ASN00086282 1970  8 TMIN  50  61  75  67  41  51  48  -7
6 ASN00086282 1970  8 PRCP   0  66   0  53  13   3   8   0
  V37 V41 V45 V49 V53 V57 V61 V65 V69 V73 V77 V81 V85 V89
1 123 108 119 112 126 112 115 133 134 126 104 143 141 134
2  49  42  48  56  51  36  44  39  40  58  15  33  51  74
3  10  23   3   0   5   0   0   0   0   0   8   0  18   0
4 166 127 117 127 159 143 114  65 113 125 129 147 161 168
5  56  62  47  33  67  84  11  41  18  50  22  28  74  94
6   0   0   3   5   0   0  64   3  99  36   8   0   0   0
  V93 V97
1 117 142
2  39  66
3   0   0
4 178 161
5  73  88
6   8  36

What are the variables? What are the experimental units?

# A tibble: 6 × 22
  iso2   year  m_04 m_514 m_014 m_1524 m_2534 m_3544 m_4554
  <chr> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 ZW     2003    NA    NA   133    874   3048   2228    981
2 ZW     2004    NA    NA   187    833   2908   2298   1056
3 ZW     2005    NA    NA   210    837   2264   1855    762
4 ZW     2006    NA    NA   215    736   2391   1939    896
5 ZW     2007     6   132   138    500   3693      0    716
6 ZW     2008    NA    NA   127    614      0   3316    704
# ℹ 13 more variables: m_5564 <dbl>, m_65 <dbl>, m_u <dbl>,
#   f_04 <dbl>, f_514 <dbl>, f_014 <dbl>, f_1524 <dbl>,
#   f_2534 <dbl>, f_3544 <dbl>, f_4554 <dbl>, f_5564 <dbl>,
#   f_65 <dbl>, f_u <dbl>

What are the variables? What are the observations?

            religion <$10k $10-20k $20-30k $30-40k
1           Agnostic    27      34      60      81
2            Atheist    12      27      37      52
3           Buddhist    27      21      30      34
4           Catholic   418     617     732     670
5 Don’t know/refused    15      14      15      11

10 week sensory experiment, 12 individuals assessed taste of french fries on several scales (how potato-y, buttery, grassy, rancid, paint-y do they taste?), fried in one of 3 different oils, replicated twice.

First few rows:

# A tibble: 4 × 9
  time  treatment subject   rep potato buttery grassy rancid
  <fct> <fct>     <fct>   <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
1 1     1         3           1    2.9     0      0      0  
2 1     1         3           2   14       0      0      1.1
3 1     1         10          1   11       6.4    0      0  
4 1     1         10          2    9.9     5.9    2.9    2.2
# ℹ 1 more variable: painty <dbl>

What is the experimental unit? What are the factors of the experiment? What was measured? What do you want to know?

Messy data patterns

There are various features of messy data that one can observe in practice. Here are some of the more commonly observed patterns:

  • Column headers are not just variable names, but also contain values
  • Variables are stored in both rows and columns, contingency table format
  • One type of experimental unit stored in multiple tables
  • Dates in many different formats

Tidy Data Conventions

  1. Data is contained in a single table
  2. Each observation forms a row (no data info in column names)
  3. Each variable forms a column (no mashup of multiple pieces of information)

Long and Wide

  • Long form: one measured value per row. All other variables are descriptors (key variables) - good for modelling, terrible for most other analyses, e.g. correlation matrix

  • Widest form: all measured values for an entity are in a single row.

  • Wide form: measurements are arranged by some of the descriptors in columns (for direct comparisons)

Illustrations from the Openscapes blog: Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

Tidy verbs

  • pivot_longer: get information out of names into columns
  • pivot_wider: make columns of observed data for levels of design variables (for comparisons)
  • separate/unite: split and combine columns
  • nest/unnest: make/unmake variables into sub-data frames of a list variable

Pivot to long form

data |> pivot_longer(cols, names_to = "name", values_to = "value", ...)
  • pivot_longer turns a wide format into a long format

  • two new variables are introduced (in key-value format): name and value

  • col defines which variables should be combined

Pivoting: an example

# wide format
dframe
  id trtA trtB
1  1  2.5   45
2  2  4.6   35
# long format
dframe |> pivot_longer(trtA:trtB, names_to="treatment", values_to="outcome")
# A tibble: 4 × 3
     id treatment outcome
  <int> <chr>       <dbl>
1     1 trtA          2.5
2     1 trtB         45  
3     2 trtA          4.6
4     2 trtB         35  

Variable Selectors

  • data |> pivot_longer(cols, names_to = "key", values_to = "value", ...)

  • cols argument identifies variables that should be combined.

  • Pattern Selectors can be used to identify variables by name, position, a range (using :), a pattern, or a combination of all.

Examples of pattern selectors

  • starts_with(match, ignore.case = TRUE, vars = NULL)

  • other select functions: ends_with, contains, matches.

  • For more details, see ?tidyselect::language

TB notifications

New notifications of TB have the form new_sp_<sex><age group>:

read_csv(here::here("data/TB_notifications_2025-07-22.csv")) |> 
  dplyr::select(country, iso3, year, starts_with("new_sp_")) |>
  na.omit() |>
  head()
# A tibble: 6 × 23
  country     iso3   year new_sp_m04 new_sp_m514 new_sp_m014
  <chr>       <chr> <dbl>      <dbl>       <dbl>       <dbl>
1 Afghanistan AFG    2010          4         193         197
2 Afghanistan AFG    2012          0         188         188
3 Albania     ALB    2005          0           0           0
4 Albania     ALB    2006          1           4           5
5 Albania     ALB    2007          0           0           0
6 Albania     ALB    2009          0           0           0
# ℹ 17 more variables: new_sp_m1524 <dbl>,
#   new_sp_m2534 <dbl>, new_sp_m3544 <dbl>,
#   new_sp_m4554 <dbl>, new_sp_m5564 <dbl>,
#   new_sp_m65 <dbl>, new_sp_mu <dbl>, new_sp_f04 <dbl>,
#   new_sp_f514 <dbl>, new_sp_f014 <dbl>,
#   new_sp_f1524 <dbl>, new_sp_f2534 <dbl>,
#   new_sp_f3544 <dbl>, new_sp_f4554 <dbl>, …

Pivot Longer: TB notifications

create two new variables: name and value

  • name contains all variable names starting with “new_sp_”
  • value contains all values of the selected variables
tb1 <- read_csv(here::here("data/TB_notifications_2025-07-22.csv")) |> 
  dplyr::select(country, iso3, year, starts_with("new_sp_")) |>
  pivot_longer(starts_with("new_sp_")) 

tb1 |> na.omit() |> head()
# A tibble: 6 × 5
  country     iso3   year name         value
  <chr>       <chr> <dbl> <chr>        <dbl>
1 Afghanistan AFG    1997 new_sp_m014      0
2 Afghanistan AFG    1997 new_sp_m1524    10
3 Afghanistan AFG    1997 new_sp_m2534     6
4 Afghanistan AFG    1997 new_sp_m3544     3
5 Afghanistan AFG    1997 new_sp_m4554     5
6 Afghanistan AFG    1997 new_sp_m5564     2

Separate columns

data |> separate_wider_delim (col, delim, names, ...)
  • split column col from data frame frame into a set of columns as specified in names
  • delim is the delimiter at which we split into columns, splitting separator.

Separate TB notifications

Work on name:

tb2 <- tb1 |>
  separate_wider_delim(
    name, delim = "_", 
    names=c("toss_new", "toss_sp", "sexage")) 

tb2 |> na.omit() |> head()
# A tibble: 6 × 7
  country     iso3   year toss_new toss_sp sexage value
  <chr>       <chr> <dbl> <chr>    <chr>   <chr>  <dbl>
1 Afghanistan AFG    1997 new      sp      m014       0
2 Afghanistan AFG    1997 new      sp      m1524     10
3 Afghanistan AFG    1997 new      sp      m2534      6
4 Afghanistan AFG    1997 new      sp      m3544      3
5 Afghanistan AFG    1997 new      sp      m4554      5
6 Afghanistan AFG    1997 new      sp      m5564      2

Separate columns

data %>% separate_wider_position(col, widths, ...)

  • split column col from frame into a set of columns specified in widths
  • widths is named numeric vector where the names become column names; unnamed components will be matched but not included.

Separate TB notifications again

Now split sexage into first character (m/f) and rest.

tb3 <- tb2 %>% dplyr::select(-starts_with("toss")) |> # remove the `toss` variables
  separate_wider_position(
    sexage,
    widths = c(sex = 1, age = 4),
    too_few = "align_start"
  )

tb3 |> na.omit() |> head()
# A tibble: 6 × 6
  country     iso3   year sex   age   value
  <chr>       <chr> <dbl> <chr> <chr> <dbl>
1 Afghanistan AFG    1997 m     014       0
2 Afghanistan AFG    1997 m     1524     10
3 Afghanistan AFG    1997 m     2534      6
4 Afghanistan AFG    1997 m     3544      3
5 Afghanistan AFG    1997 m     4554      5
6 Afghanistan AFG    1997 m     5564      2

Your turn

Read the genes data from folder data. Column names contain data and are kind of messy.

genes <- read_csv(here::here("data/genes.csv"))

names(genes)
 [1] "id"       "WI-6.R1"  "WI-6.R2"  "WI-6.R4"  "WM-6.R1" 
 [6] "WM-6.R2"  "WI-12.R1" "WI-12.R2" "WI-12.R4" "WM-12.R1"
[11] "WM-12.R2" "WM-12.R4"

Produce the data frame called gtidy as shown below:

head(gtidy)
# A tibble: 6 × 5
  id     trt   time  rep    expr
  <chr>  <chr> <chr> <chr> <dbl>
1 Gene 1 I     6     1      2.18
2 Gene 1 I     6     2      2.20
3 Gene 1 I     6     4      4.20
4 Gene 1 M     6     1      2.63
5 Gene 1 M     6     2      5.06
6 Gene 1 I     12    1      4.54

Plot the genes data overlaid with group means

gmean <- gtidy |> 
  group_by(id, trt, time) |> 
  summarise(expr = mean(expr))
gtidy |> 
  ggplot(aes(x = trt, y = expr, colour=time)) +
  geom_point() +
  geom_line(data = gmean, aes(group = time)) +
  facet_wrap(~id) +
  scale_colour_brewer("", palette="Set1")

Getting data into tidy form is the singularly most efficient and generalisable way to do data analysis

Handling missing values

naniar

Reproducibility, workflow and versioning

Dynamic documents

  • Efficiency: allow changes to be implemented more easily, especially for dynamic reproducible documents.

  • Repeatability: the analysis can be repeated multiple times while still obtaining the same results.

  • Transparency: everything is available for access, resulting in more trustworthy results.

  • Easy to update: when new data arrives, the report can be automatically updated.

How might the project look?

How to combine text and data analysis?


Literate programming

Literate programming is an approach to writing reports using software that weaves together the source code and text at the time of creation.

Reproducibility requires more than literate programming. These are:

  • a versioning and sharing system, like GitHub and git.
  • software environment supporting workflows such as targets or renv.

Getting started

First step

Practicing

  1. Create a project.

  2. Create a quarto document.

  3. Render document.

Your turn!

Elements of a reproducible project

  • All the elements of the project should be files.
  • All files should be stored within the project location (typically a folder).
  • All files should be explicitly tied together.

But how do we tie the files together?

Computer paths

A path is the complete location or name of where a computer file, directory, or web page is located.

Examples:

  • Windows: C:\\Documents\\workshop
  • Mac/Linux: /Users/Documents/workshop
  • Internet: https://numbat.space/

Absolute and Relative paths

  • Absolute: start from the lowest level, typically a drive letter or root (/)
    • /Users/Documents/workshop ⚠️
  • Relative: refers to a location that is relative to the current directory.
    • ./workshop

Important

Absolute paths should be avoided since it is extremely unlikely another person will have the same absolute path as you.

Work projects

  • Data folder: contains all the data for the project.
  • Images/Figures folder: contains all the external pictures not produced by the code in the qmd file.
  • .Rproj file: automatically added when creating an RStudio project (handles the relative paths and working directories).
  • qmd file: quarto document
  • Other R scripts, etc…

Quarto details

Quarto

  • Provides a framework for integrating code and text into a single document.

  • The code is written within the code chunks, put the text around that, and get a fully reproducible document.

Quarto document elements

  1. Text (formatted with Markdown)

  2. Code (code formatting)

  3. Metadata (YAML)

Quarto: text (Markdown)

Markdown is a lightweight markup language for adding formatting elements to plain text documents.

  • Text formatting
  • Headings
  • Links & Images
  • Lists
  • Many more…

Text formatting & Headings

Markdown Syntax:

*italics*, _italics_

**bold**, __bold__

***bold italics***, ___bold italics___

~~strikethrough~~

`verbatimcode`

# Heading 1

## Heading 2

Results:

italics, italics

bold, bold

bold italics, bold italics

strikethrough

verbatimcode

Heading 1

Heading 2

Quarto: code (R)

R code:

```{r}
#| echo: false

1+1
``` 

Results:

[1] 2

Insert an R code chunk into a Quarto document with:

  • Keyboard short cut Ctrl + Alt + I (Mac: Cmd + Option + I)

  • Typing the chunk delimiters (```)

Chunk output can be customised with Chunk execution options, which are at the top of a chunk, starting with #|

Chunk execution options

  • eval: false does not evaluate (run) this code chunk when knitting.
  • echo: false does not show the source code in the finished file.
  • include: false prevents code and results from showing in the finished file.
  • message: false prevents messages that are generated by code from showing in the finished file.
  • warning: false prevents warnings that are generated from showing in the finished file
  • fig.cap = "Text" adds a caption to a figure

There are many more; see Quarto documentation.

Tables and captions

R code:

```{r}
#| echo: false

library(ggplot2)

data(cars)

table_data <- head(cars, 5)

knitr::kable(table_data,
             caption = "Speed and stopping 
             distances of cars")
``` 

Results:

Speed and stopping distances of cars
speed dist
4 2
4 10
7 4
7 22
8 16

Figures and captions

R code:

```{r}
#| label: cars-plot
#| fig-cap: "Distance taken for a car to stop, against it's speed during the test."

library(ggplot2)

ggplot(cars,
      aes(x = speed,
          y = dist)
      ) +
  geom_point()
``` 

Results:

library(ggplot2)

ggplot(cars,
      aes(x = speed,
          y = dist)
      ) +
  geom_point()

Distance taken for a car to stop, against it’s speed during the test.

Quarto: YAML

Basic YAML syntax

title: "My report"
author: "Krisanat A."
format:
  html:
    toc: true
    theme: solar
  pdf:
    toc: true

HTML result

PDF result

Your turn!

Practice

  1. Change the title
  2. Add your name as an author
  3. Use HTML and PDF format

Quarto templates

What is a quarto template?

The templates provide a straightforward way to get started with new Quarto projects by providing example content and options.

  1. Create a working initial document for custom formats

  2. Provide the initial content for a custom project type

Remember all the painstaking work we did earlier, setting YAML, creating all the folders, and setting the execution options.

All of that can be gone with one line in the terminal!!

📚 Adding a Bibliography to .qmd Files

### 1. Create a `.bib` File (BibTeX Format)

```bibtex
@article{smith2021,
  title = {A Method for Data Analysis},
  author = {Smith, Jane},
  journal = {Journal of Statistics},
  year = {2021}
}

💡 Save as: references.bib

bibliography: references.bib
csl: apa.csl  # Optional: adds citation style (e.g., APA, IEEE)

2. Cite in the Text

As shown in recent work [@smith2021], the method is effective.

🔄 Quarto will automatically format the in-text citation and generate a reference list.

⚙️ Quarto Quirks & Power Tips

📐 LaTeX Equation in a Figure Caption

![This figure demonstrates the equation $E = mc^2$.](figure.png){#fig-einstein}

✅ You can embed LaTeX-style equations directly using dollar signs ($...$) in the caption.

✅ Works in both HTML and PDF outputs.

Refer to Another Figure in a Caption

![Here we compare this with Figure @fig-einstein.](comparison.png){#fig-compare}

📝 @fig-einstein will resolve to the numbered figure reference (e.g., Figure 1) in PDF/HTML.

📌 Make sure the first figure has an ID like {#fig-einstein}.

| Tool   | Link                              |
|--------|-----------------------------------|
| Quarto | [quarto.org](https://quarto.org) |

✅ Use standard Markdown syntax inside tables: [text](url)

✅ Works for external links, local files, and section anchors.

🖼 Alt-Text in Figure Chunk

![Scatterplot of speed vs. distance](cars-plot.png){fig-alt="A scatterplot showing car stopping distances by speed."}

✅ Use fig-alt="..." to improve accessibility and support screen readers.

Resources